authorship attribution
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > United States > Arizona > Maricopa County > Tempe (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.28)
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > Romania > București - Ilfov Development Region > Municipality of Bucharest > Bucharest (0.05)
- (10 more...)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Information Technology > Security & Privacy (1.00)
- Law > Criminal Law (0.93)
Large Language Models and Forensic Linguistics: Navigating Opportunities and Threats in the Age of Generative AI
Large language models (LLMs) present a dual challenge for forensic linguistics. They serve as powerful analytical tools enabling scalable corpus analysis and embedding-based authorship attribution, while simultaneously destabilising foundational assumptions about idiolect through style mimicry, authorship obfuscation, and the proliferation of synthetic texts. Recent stylometric research indicates that LLMs can approximate surface stylistic features yet exhibit detectable differences from human writers, a tension with significant forensic implications. However, current AI-text detection techniques, whether classifier-based, stylometric, or watermarking approaches, face substantial limitations: high false positive rates for non-native English writers and vulnerability to adversarial strategies such as homoglyph substitution. These uncertainties raise concerns under legal admissibility standards, particularly the Daubert and Kumho Tire frameworks. The article concludes that forensic linguistics requires methodological reconfiguration to remain scientifically credible and legally admissible. Proposed adaptations include hybrid human-AI workflows, explainable detection paradigms beyond binary classification, and validation regimes measuring error and bias across diverse populations. The discipline's core insight, i.e., that language reveals information about its producer, remains valid but must accommodate increasingly complex chains of human and machine authorship.
- North America > United States (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Provence-Alpes-Côte d'Azur > Alpes-Maritimes > Nice (0.04)
- (2 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Education > Educational Setting > Online (0.46)
The Hidden DNA of LLM-Generated JavaScript: Structural Patterns Enable High-Accuracy Authorship Attribution
Tihanyi, Norbert, Cherif, Bilel, Dubniczky, Richard A., Ferrag, Mohamed Amine, Bisztray, Tamás
In this paper, we present the first large-scale study exploring whether JavaScript code generated by Large Language Models (LLMs) can reveal which model produced it, enabling reliable authorship attribution and model fingerprinting. With the rapid rise of AI-generated code, attribution is playing a critical role in detecting vulnerabilities, flagging malicious content, and ensuring accountability. While AI-vs-human detection usually treats AI as a single category we show that individual LLMs leave unique stylistic signatures, even among models belonging to the same family or parameter size. To this end, we introduce LLM-NodeJS, a dataset of 50,000 Node.js back-end programs from 20 large language models. Each has four transformed variants, yielding 250,000 unique JavaScript samples and two additional representations (JSIR and AST) for diverse research applications. Using this dataset, we benchmark traditional machine learning classifiers against fine-tuned Transformer encoders and introduce CodeT5-JSA, a custom architecture derived from the 770M-parameter CodeT5 model with its decoder removed and a modified classification head. It achieves 95.8% accuracy on five-class attribution, 94.6% on ten-class, and 88.5% on twenty-class tasks, surpassing other tested models such as BERT, CodeBERT, and Longformer. We demonstrate that classifiers capture deeper stylistic regularities in program dataflow and structure, rather than relying on surface-level features. As a result, attribution remains effective even after mangling, comment removal, and heavy code transformations. To support open science and reproducibility, we release the LLM-NodeJS dataset, Google Colab training scripts, and all related materials on GitHub: https://github.com/LLM-NodeJS-dataset.
- North America > United States > New York > New York County > New York City (0.05)
- Europe > Norway > Eastern Norway > Oslo (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (12 more...)
A Comparative Study of LLM Prompting and Fine-Tuning for Cross-genre Authorship Attribution on Chinese Lyrics
Li, Yuxin, Xu, Lorraine, Wang, Meng Fan
We propose a novel study on authorship attribution for Chinese lyrics, a domain where clean, public datasets are sorely lacking. Our contributions are twofold: (1) we create a new, balanced dataset of Chinese lyrics spanning multiple genres, and (2) we develop and fine-tune a domain-specific model, comparing its performance against zero-shot inference using the DeepSeek LLM. We test two central hypotheses. First, we hypothesize that a fine-tuned model will outperform a zero-shot LLM baseline. Second, we hypothesize that performance is genre-dependent. Our experiments strongly confirm Hypothesis 2: structured genres (e.g. Folklore & Tradition) yield significantly higher attribution accuracy than more abstract genres (e.g. Love & Romance). Hypothesis 1 receives only partial support: fine-tuning improves robustness and generalization in Test1 (real-world data and difficult genres), but offers limited or ambiguous gains in Test2, a smaller, synthetically-augmented set. We show that the design limitations of Test2 (e.g., label imbalance, shallow lexical differences, and narrow genre sampling) can obscure the true effectiveness of fine-tuning. Our work establishes the first benchmark for cross-genre Chinese lyric attribution, highlights the importance of genre-sensitive evaluation, and provides a public dataset and analytical framework for future research. We conclude with recommendations: enlarge and diversify test sets, reduce reliance on token-level data augmentation, balance author representation across genres, and investigate domain-adaptive pretraining as a pathway for improved attribution performance.
- North America > United States (0.04)
- Asia > Singapore (0.04)
- Asia > China (0.04)
BARD10: A New Benchmark Reveals Significance of Bangla Stop-Words in Authorship Attribution
Moosa, Abdullah Muhammad, Sultana, Nusrat, Moosa, Mahdi Muhammad, Hossain, Md. Miraiz
This research presents a comprehensive investigation into Bangla authorship attribution, introducing a new balanced benchmark corpus BARD10 (Bangla Authorship Recognition Dataset of 10 authors) and systematically analyzing the impact of stop - word removal across classical and deep learning models to uncover the stylistic significance of Bangla stop - words. BARD10 is a curated corpus of Bangla blog and opinion prose from ten contemporary authors, alongside the methodical assessment of four representative class ifiers: SVM (Support V ector Machine), Bangla BERT (Bidirectional Encoder Representations from Transformers), XGBoost, and a MLP (Multilayer Perce p tion), utilizing uniform preprocessing on both BARD10 and the benchmark corpora BAAD16 (Bangla Authorship Attribution Dataset of 16 authors). In all datasets, the classical TF - IDF + SVM baseline outperformed, attaining a macro - F1 score of 0.997 on BAAD16 a nd 0.921 on BARD10, while Bangla BERT lagged by as much as five points. This study reveals that BARD10 authors are highly sensitive to sto p - word pruning, while BAAD16 authors remain comparatively robust highlighting genre - dependent reliance on stop - word signatures. Error analysis revealed that high frequency components transmit authorial signatures that are diminished or reduced by transformer models. Three insights are identified: Bangla stop - words serve as essential stylistic indicators; finely calibrated ML models prove effective within short - text limitations; and BARD10 connects formal literature with contemporary web dialogue, offering a reproducible benchmark for future long - context or domain - adapted transformers.
- Europe > Switzerland (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
Detecting Sockpuppetry on Wikipedia Using Meta-Learning
Raszewski, Luc, De Kock, Christine
Malicious sockpuppet detection on Wikipedia is critical to preserving access to reliable information on the internet and preventing the spread of disinformation. Prior machine learning approaches rely on stylistic and meta-data features, but do not prioritise adaptability to author-specific behaviours. As a result, they struggle to effectively model the behaviour of specific sockpuppet-groups, especially when text data is limited. To address this, we propose the application of meta-learning, a machine learning technique designed to improve performance in data-scarce settings by training models across multiple tasks. Meta-learning optimises a model for rapid adaptation to the writing style of a new sockpuppet-group. Our results show that meta-learning significantly enhances the precision of predictions compared to pre-trained models, marking an advancement in combating sockpuppetry on open editing platforms. We release a new dataset of sockpuppet investigations to foster future research in both sockpuppetry and meta-learning fields.
- Asia (0.14)
- North America > United States > Michigan (0.04)
- Media > News (0.48)
- Information Technology > Security & Privacy (0.46)
A Stylometric Application of Large Language Models
Stropkay, Harrison F., Chen, Jiayi, Latifi, Mohammad J., Rockmore, Daniel N., Manning, Jeremy R.
We show that large language models (LLMs) can be used to distinguish the writings of different authors. Specifically, an individual GPT-2 model, trained from scratch on the works of one author, will predict held-out text from that author more accurately than held-out text from other authors. We suggest that, in this way, a model trained on one author's works embodies the unique writing style of that author. We first demonstrate our approach on books written by eight different (known) authors. We also use this approach to confirm R. P. Thompson's authorship of the well-studied 15th book of the Oz series, originally attributed to F. L. Baum.
- North America > United States > New Hampshire > Grafton County > Hanover (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.70)
Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank
Agarwal, Shantanu, Barry, Joel, Fincke, Steven, Miller, Scott
Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.
- North America > United States > California (0.14)
- Europe > Ukraine (0.14)
- Europe > Austria > Vienna (0.14)
- (7 more...)
- Government > Regional Government > North America Government > United States Government (0.46)
- Media > News (0.46)